Goto

Collaborating Authors

 score prediction




Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation

Tanji, Naoto, Yamasaki, Toshihiko

arXiv.org Artificial Intelligence

ABSTRACT Image scoring is a crucial task in numerous real-world applications. To trust a model's judgment, understanding its rationale is essential. This paper proposes a novel training method for Vision Language Models (VLMs) to generate not only image scores but also corresponding justifications in natural language. Leveraging only an image scoring dataset and an instruction-tuned VLM, our method enables self-training, utilizing the VLM's generated text without relying on external data or models. In addition, we introduce a simple method for creating a dataset designed to improve alignment between predicted scores and their textual justifications. By iteratively training the model with Direct Preference Optimization on two distinct datasets and merging them, we can improve both scoring accuracy and the coherence of generated explanations. Index T erms-- Vision language model, Explainable AI, Image scoring, Self-training, Direct Preference Optimization 1. INTRODUCTION Deep learning is revolutionizing image analysis, enabling automated classification and scoring with enhanced accuracy and efficiency. Examples include disease detection in medical images, defect identification in quality control, and predicting advertising effectiveness.


SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers

Wu, Wenqing, Zhang, Chengzhi, Bao, Tong, Zhao, Yi

arXiv.org Artificial Intelligence

Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at https://github.com/njust-winchy/SC4ANM.


Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?

Qian, Shenbin, Orăsan, Constantin, Kanojia, Diptesh, Carmo, Félix do

arXiv.org Artificial Intelligence

This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.


Exploring Kolmogorov-Arnold networks for realistic image sharpness assessment

Yu, Shaode, Chen, Ze, Yang, Zhimu, Gu, Jiacheng, Feng, Bizu

arXiv.org Artificial Intelligence

Score prediction is crucial in realistic image sharpness assessment after informative features are collected. Recently, Kolmogorov-Arnold networks (KANs) have been developed and witnessed remarkable success in data fitting. This study presents Taylor series based KAN (TaylorKAN). Then, different KANs are explored on four realistic image databases (BID2011, CID2013, CLIVE, and KonIQ-10k) for score prediction by using 15 mid-level features and 2048 high-level features. When setting support vector regression as the baseline, experimental results indicate KANs are generally better or competitive, TaylorKAN is the best on three databases using mid-level feature input, while KANs are inferior on CLIVE when high-level features are used. This is the first study that explores KANs for image quality assessment. It sheds lights on how to select and improve KANs on related tasks.


Reading ability detection using eye-tracking data with LSTM-based few-shot learning

Li, Nanxi, Wang, Hongjiang, Zhan, Zehui

arXiv.org Artificial Intelligence

Previous works demonstrated that eye-tracking data supplied meaningful information for reading ability detection, and have gained promising results by employing machine learning methods [1-18]. The eye-tracking based methods of reading ability detection fell into two main categories: the one estimated reading ability with finite number of classes [1-14], providing qualitative evaluation of subjects' reading ability. The other predicted reading ability scores with regression models [15-18], rendering quantitative evaluation of subjects' reading ability. Although the former exhibited satisfactory accuracy in detecting certain classes of abnormalities in reading, it lacked the capability of predicting exact scores of reading ability, which was emphasized in highly interactive educational environments (such as online learning) to make personal and intelligent reactions to subjects. However, precise score prediction of reading ability using eye-tracking data is not easy [15-18], especially when the sample data of subjects are few. In this paper, with few-shot learning strategy, a regression model for score prediction is proposed by combining Long Short Time Memory (LSTM) [19] and light-weighted neural networks. The proposed model exhibits higher accuracy than previous score prediction models tested on the same dataset.


Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs

Thelwall, Mike

arXiv.org Artificial Intelligence

Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process. This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts. The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.


Scoreformer: A Surrogate Model For Large-Scale Prediction of Docking Scores

Ciudad, Álvaro, Morales-Pastor, Adrián, Malo, Laura, Filella-Mercè, Isaac, Guallar, Victor, Molina, Alexis

arXiv.org Artificial Intelligence

In this study, we present ScoreFormer, a novel graph transformer model designed to accurately predict molecular docking scores, thereby optimizing high-throughput virtual screening (HTVS) in drug discovery. The architecture integrates Principal Neighborhood Aggregation (PNA) and Learnable Random Walk Positional Encodings (LRWPE), enhancing the model's ability to understand complex molecular structures and their relationship with their respective docking scores. This approach significantly surpasses traditional HTVS methods and recent Graph Neural Network (GNN) models in both recovery and efficiency due to a wider coverage of the chemical space and enhanced performance. Our results demonstrate that ScoreFormer achieves competitive performance in docking score prediction and offers a substantial 1.65-fold reduction in inference time compared to existing models. We evaluated ScoreFormer across multiple datasets under various conditions, confirming its robustness and reliability in identifying potential drug candidates rapidly.


Connected Speech-Based Cognitive Assessment in Chinese and English

Luz, Saturnino, Garcia, Sofia De La Fuente, Haider, Fasih, Fromm, Davida, MacWhinney, Brian, Lanzi, Alyssa, Chang, Ya-Ning, Chou, Chia-Ju, Liu, Yi-Chien

arXiv.org Artificial Intelligence

We present a novel benchmark dataset and prediction tasks for investigating approaches to assess cognitive function through analysis of connected speech. The dataset consists of speech samples and clinical information for speakers of Mandarin Chinese and English with different levels of cognitive impairment as well as individuals with normal cognition. These data have been carefully matched by age and sex by propensity score analysis to ensure balance and representativity in model training. The prediction tasks encompass mild cognitive impairment diagnosis and cognitive test score prediction. This framework was designed to encourage the development of approaches to speech-based cognitive assessment which generalise across languages. We illustrate it by presenting baseline prediction models that employ language-agnostic and comparable features for diagnosis and cognitive test score prediction. The models achieved unweighted average recall was 59.2% in diagnosis, and root mean squared error of 2.89 in score prediction.